0.1 Workshop objectives

  1. Understand the basis of biogeographical regionalization
  2. Quantify spatial turnover of species in gridded communities
  3. Cluster a dissimilarity matrix to delimit biologically distinct regions in space
  4. Investigate the role of climate in defining of biogeographic regions in space

0.2 Before you start

For this tutorial you will need the following R packages:

Remember that you can install packages in R with install.packages("package name")

1 Biogegraphic regionalization

Scientific aspects to define biogeographic regions

  • The characterization of geographical areas in terms of biodiversity.

  • Fundamental abstractions of life organization in the planet in response to past or current ecological forces.

2 Defining the study area (R.O.I)

First we must define our region of interest (ROI) where we want to delimit biogeographic regions.

The limits of the ROI can be placed in space, time, or both.

Some sources of information on planetary scale geographical limits.

3 Species distribution data

Data on the geographic distribution of species typically come from of biodiversity inventories in the field.

Sources of information often are:

  • Global open databases

  • National/Provincial Surveys

  • Cooperation among scientists.

Information obtained from these sources are usually stored as:

  • geocoded observations (points)
x y sp
37 -26 Sp 5
11 -7 Sp 5
8 -16 Sp 1
7 -35 Sp 1
12 -25 Sp 5
10 -6 Sp 1
36 -33 Sp 2
3 -30 Sp 3
26 -10 Sp 4
22 -37 Sp 1

  • Species ranges (polygons,raster)

https://www.iucnredlist.org/species/172220/6852079

https://www.iucnredlist.org/species/173248/6979817

4 Delineating biogeographic regions in the ROI

4.1 Obtaining species occurrences

For this tutorial we will work with data in the occurrence of briofites in North America. This dataset comes from the bryophyte collections of the University of British Columbia.

https://beatymuseum.ubc.ca/research-2/collections/herbarium/herbarium-bryophytes/

I previously downloaded the dataset from the GBIF, if you want to download it yourself it is here: https://www.gbif.org/dataset/4edd9396-59df-4b01-9e29-dc21a59f9963

briofites

briofites

Lets load the dataset

The raw dataset is extensive, and there are some data which is obvioulsy misplaced.

We must have a clean version of the dataset before we start doing any analyses on it.

Let’s start by removing all fields that have no associated geographic information.

Second, let’s set our ROI whitin the continental political boundaries of Canada and the United States of America.

1x1 degree is a commonly use standard grid area for macroecology.

Does grid resolution affect our observations of spatial patterns in biodiversity?

imgAgg

imgAgg

## [1] 30.18 82.50

  • How constant is the within-grid variation in species diversity throughout the ROI?

With our clean dataset we can proceed to build the species-site matrix Entries of this matrix show whether a species (rows) is present at a given site (columns) within the ROI. This matrix holds the information on the distribution of biodiversity in a ROI.

\[\begin{bmatrix} sites & site1 & site2 & ... & site_n \\Sp1 & 1 & 0 & ...& 1 \\Sp2 & 0 & 1 & ...& 1 \\Sp3 & 1 & 1 & ...& 0 \\Sp4 & 1 & 0 & ...& 1 \\\vdots & \vdots & \vdots & \vdots & \vdots \\Sp_j & 1 & 1 & 1 & 0 \end{bmatrix}\]

Note we only care whether a species is present or absent from a site. In this tutorial, we don’t focus in the relative abundance of species.

Biogeographic regions are defined based on their dissimilarity of biological communities with other biogeographic regions. The deliniation of biogeographic regions is based on clustering the variance in community composition (i.e. turnover) of a certain taxon or taxa throughout the ROI. Therefore, we must find a way to compute how different each cell of our biofrite grid is from all others in the ROI.

Dissimilarites arise fromthe ratio of the number of shared species relative to a total of the species diversity. This measure of turnover is representative of the ß-component of diversity. There are several community dissimilarity metrics For this tutorial we will use the Simpson’s dissmilarity index as measure of speciess turnover among communities.

\[ S_i = \frac{a}{a + min(b,c)} \]

being \(a\) the number of shared species and \(min(b,c)\) the minimun number of species between two communities. Simpson dissimilarity index uses only the poorest community in the denominator to account for large differences in richness between communities.

Let’s define a function to compute this index and apply it to our data to obtain a dissimilarity matrix .

Lets observe the variance of the dissimilarity distance matrix, how many clusters can you observe by eyeballing a heatmap?

This distance matrix contains the multidimensional variation in biodiversity turnover. Each dimension represents the gradients of dissimilarity from a focal grid.

Unless we are interested in the decay of biodiversity turnover from a focal point having many dimensions is not so informative. How we can find more general patterns in space?

5 Linearizing multidimensionality

5.1 Continous

5.1.1 Dimensionality reduction analysis

Ordination are a set of statistical techniques used principally to reduce the dimensionality of a matrix. In our case we can think about sites and species as dimensions. Using this vector we can construct the most parsimonious organization in 2d (or more) space.

We will use a technique called Non-Metric Multidimensional (NMDS) scaling to find the most parsimonious arragement of our distance-matrix items that preserve the community wide pattern of pairwise distances between species and sites.

We will not to go into details of the procedure, but if you need a reminder of ordination methods or the mathematical basis of NMDS you can find information in the following links:

Ordination:

http://ordination.okstate.edu/overview.htm

https://wiki.qcbs.ca/r_workshop9

NMDS:

http://www.flutterbys.com.au/stats/tut/tut15.1.html

http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

## Run 0 stress 0.153179 
## Run 1 stress 0.1531917 
## ... Procrustes: rmse 0.001691191  max resid 0.02920649 
## Run 2 stress 0.1531902 
## ... Procrustes: rmse 0.002890082  max resid 0.05996124 
## Run 3 stress 0.153199 
## ... Procrustes: rmse 0.002215938  max resid 0.02818661 
## Run 4 stress 0.1534809 
## ... Procrustes: rmse 0.008229608  max resid 0.1735261 
## Run 5 stress 0.1531988 
## ... Procrustes: rmse 0.00274696  max resid 0.03522366 
## Run 6 stress 0.153177 
## ... New best solution
## ... Procrustes: rmse 0.001439533  max resid 0.02576394 
## Run 7 stress 0.1532257 
## ... Procrustes: rmse 0.004433792  max resid 0.08379174 
## Run 8 stress 0.1532045 
## ... Procrustes: rmse 0.003716143  max resid 0.06877719 
## Run 9 stress 0.1531941 
## ... Procrustes: rmse 0.002639719  max resid 0.02661373 
## Run 10 stress 0.1531749 
## ... New best solution
## ... Procrustes: rmse 0.006086831  max resid 0.1197146 
## Run 11 stress 0.1532273 
## ... Procrustes: rmse 0.007744202  max resid 0.1634501 
## Run 12 stress 0.1532057 
## ... Procrustes: rmse 0.006399744  max resid 0.1350686 
## Run 13 stress 0.1532509 
## ... Procrustes: rmse 0.008322298  max resid 0.1761186 
## Run 14 stress 0.1531838 
## ... Procrustes: rmse 0.006148429  max resid 0.129468 
## Run 15 stress 0.1533899 
## ... Procrustes: rmse 0.007470562  max resid 0.1567682 
## Run 16 stress 0.1531927 
## ... Procrustes: rmse 0.00530156  max resid 0.1097783 
## Run 17 stress 0.1532665 
## ... Procrustes: rmse 0.00889758  max resid 0.1892822 
## Run 18 stress 0.1532023 
## ... Procrustes: rmse 0.007697044  max resid 0.1634441 
## Run 19 stress 0.1531988 
## ... Procrustes: rmse 0.006778018  max resid 0.1429533 
## Run 20 stress 0.1533374 
## ... Procrustes: rmse 0.003336497  max resid 0.06724175 
## *** No convergence -- monoMDS stopping criteria:
##      9: no. of iterations >= maxit
##     11: stress ratio > sratmax
## species scores not available

5.2 Categorical

5.3 K-means clustering

k-means is a fast clustering algorithm. The k-means algorithm searchs to partition the variance of a matrix into groups of k clusters centered around k means.

This is common methodology for *un-supervised" classfication tasks. That is, we let the data to tell us the best partition into separate classes.

steps for the K-means algorithm (from wikipedia)

steps for the K-means algorithm (from wikipedia)

K-means algorithm assumes euclidean distances to compute k clusters. Simpson dissimilarity index index does not project linearly into euclidean space. The Hellinger standarization is a workaround which let’s us use euclidean metrics with community dissimilairity metrics.

\[y_{i,j} = \sqrt{\frac{y_{i,j}}{y_i}}\]

The main limitation of the k-means algorithm is that the cluster association vector is always dependant on the values of k. How do we know the best k then?. This therefore becomes an optimization problem for the value of k.

A solution to find the best k is to interate the k-means algorithm for a series of k within a range of reasonable estimate for the number of clusters given our community matrix. (e.g k << S or k < S not k = S or k > S)

We optimize the ratio between the variance within k given clusters in relation with the total amount of variation in the matrix. We select the k parameter which the sum of squares of the distances between k clusters (i.e. the distances of the k-centroids to its mean centroid) is closest to the sum of squares of the whole matrix.

Lets observe the change on the sum of squares and the similar Calinski criterion after applying the k-means for a range of \(k={2,3,4,5,6,7,8,9,10}\)

##         KM2 KM3 KM4 X1   X2
## 31_-90    1   2   4 31  -90
## 32_-106   1   2   4 32 -106
## 33_-106   1   2   4 33 -106
## 33_-107   1   2   4 33 -107
## 33_-108   1   2   4 33 -108
## 34_-78    1   2   4 34  -78
## Warning in RColorBrewer::brewer.pal(n, pal): minimal value for n is 3, returning requested palette with 3 different levels

6 Investigating the environmental drivers of biogeographic regionalization

##        df       AIC
## model   4 -36888.89
## model1  5 -36904.64
## model2  4  38546.29
## model3  5  38520.50
## model4  5  35514.97
## Analysis of Variance Table
## 
## Response: log(colPC1 + 10)
##                    Df Sum Sq Mean Sq F value    Pr(>F)    
## climSpace$BIO1      1  2.010  2.0104  417.28 < 2.2e-16 ***
## climSpace$Bio12     1  4.398  4.3977  912.80 < 2.2e-16 ***
## Residuals       14769 71.154  0.0048                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Carter et al., 2016. Endemic regions of mosses in North America

Carter et al., 2016. Endemic regions of mosses in North America